Test changes systematically

Improving performance is easier if you can measure it.

In some cases a modification to a prompt will achieve better performance on a few isolated examples but lead to worse overall performance on a more representative set of examples.

「数例では改善するが、全体としては性能を損なう場合もありうる」

Therefore to be sure that a change is net positive to performance it may be necessary to define a comprehensive test suite (also known an as an "eval").

「変更は性能に対して正味ポジティブと確かめるために、包括的なテストスイートを定義することが必要となるかもしれない」

https://platform.openai.com/docs/guides/prompt-engineering/strategy-test-changes-systematically

Looking at a few examples may hint at which is better, but with small sample sizes it can be hard to distinguish between a true improvement or random luck.

サンプルサイズの表がある（95%信頼区間）

10個以下では30%の差分が検出される

IMO：たまたまかもしれないということ？

Evaluation procedures (or "evals") are useful for optimizing system designs

よいevalsの性質3つ

Evaluation of outputs can be done by computers, humans, or a mix.

Computers can automate evals with objective criteria (e.g., questions with single correct answers) as well as some subjective or fuzzy criteria, in which model outputs are evaluated by other model queries.

OpenAI Evals

We encourage experimentation to figure out how well model-based evals can work for your use case.

Evaluate model outputs with reference to gold-standard answers